Level up your optimization process: 
how to implement distributed 
profiling and why you want to have it 
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About me 


* Developer in the Yandex infrastructure advertising team 

* Develop a system with one million requests per second 

e Specialize in data structure optimization and big data analysis 
* Develop a distributed profiling system 


Contact me 
in Telegram: 
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Profiling 


e Profiler is a performance analysis tool for applications 
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GNU Gprof 
perf 


Distributed profiling 


* Distributed profiler is a performance analysis tool for distributed applications that 
aggregates data from multiple hosts 
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Flame graphs 


Motivation 


DISTRIBUTED 
PROF ILIER 


1. Exploration of targets for optimization 


2. Complexity of using conventional local profilers 


perf Examples 


These are some examples of using the perf Linux profiler, which has also been called Performance 

3 vents Event Sources 
Counters for Linux (PCL), Linux perf events (LPE), or perf_events. Like Vince Weaver, I'll call it Linux perf. events Event Sources 
perf_events so that you can search on that term later. Searching for just "perf" finds sites on the police, Tracepoints syscalls: PMCs 
petroleum, weed control, and a T-shirt. This is not an official perf page, for either perf_events or the T- enti: sok: sched: 
shirt. Operating System task: instructions 


signal: branch-* 
timer: L1-* 


workgueue : LLC-* 


cycles 


perf events is an event-oriented observability tool, which can help you solve advanced performance 
and troubleshooting functions. Questions that can be answered include: 


Why is the kernel on-CPU so much? What code-paths? 
Which code-paths are causing CPU level 2 cache misses? writeback: 
Are the CPUs stalled on memory I/O? 

Which code-paths are allocating memory, and how much? 
What is triggering TCP retransmits? cpu-clock  page-faults 
Is a certain kernel function being called, and how often? m ponla 
What reasons are threads leaving the CPU? 


A A A A : I license: creative commons Attribution-ShareA like 4.0. 
perf_events is part of the Linux kernel, under tools/perf. While it uses many Linux tracing features, EE a SAA 


some are not yet exposed via the perf command, and need to be used via the ftrace interface instead. My perf-tools collection (github) uses both perf events and ftrace as needed. 
This page includes my examples of perf events. A table of contents: 


1. Screenshot 5. Events 6.6. Dynamic Tracing 
2. One-Liners 5.1. Software Events 6.7. Scheduler Analysis 
3. Presentations 5.2. Hardware Events (PMCs) 6.8. eBPF 
4. Background 5.3. Kernel Tracepoints 7. Visualizations 
4.1. Prereguisites 5.4. USDT 1.1. Flame Graphs 
4.2. Symbols 5.5. Dynamic Tracing 7.2. Heat Maps 
4.3. JIT Symbols (Java, Node.js) 6. Examples 8. Targets 
4.4. Stack Traces 6.1. CPU Statistics 9. More 
4.5. Audience 6.2. Timed Profiling 10. Building 
4.6. Usage 6.3. Event Profiling 11. Troubleshooting 
4.7. Usage Examples 6.4. Static Kernel Tracing 12. Other Tools 
4.8. Special Usage 6.5. Static User Tracing 13. Resources 


3. Waste of time on every request 


4. Difficulty of a historical comparison 
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5. Impact on the performance of the application 
being profiled 


* You start to profile a host 

* There is only one host, so sampling is frequent 

* The host degrades in comparison with other hosts 

e Balancers try to ignore the host 

* You have measured a degraded host with an unrepresentative workload 
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Theory 


Sampling and instrumentation profilers 


Sampling profiler Instrumentation profiler 
* perf * Manual 
* OProfile * Automatic source level 


e Intermediate language 

* Compiler assisted 

* Binary translation 

e Runtime instrumentation 
e Runtime injection 


Instrumentation profilers' limitations 


* Performance changes: stack trace writing is too expensive 


* Heisenbugs: writing a stack trace may change the execution order 


Sampling profiler math 
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Sampling profiler math: definitions 


@ The profiler pauses the program at a random point and prints a stack trace. 
Q X := printed stack. 


Q The sample space of the random variable X is the set of all possible stacks: 


S = {S | S is a stack in program} 


Definition 
Indicator random variable for a stack S: 


io = BO 5e 
po 0 0 
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Sampling profiler math: asymptotic distribution 


Q Consider the following sequence: 
IN == D (Xi) — u) 
S VN 
@ According to the C.L.T: 
lim 1 — N(0, c?) 
N—oo 
Q Asymptotically: 


N 
I(X;) = 3. + p a N(u,0°/N) 


GR 
| 
z|= 
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Sampling profiler math: relative error and examples 


@ Relative error: i 


e — X(N)/u - 1 Nn GR) 


Q According to the three-sigma rule: 
N  e*fu 
Q Particular solutions: 


Examples 


O 
e=0.1,4=0.01 => N — 10'000 (30-case : 90'000) 


@ 
e = 0.1, u = 0.001 => N = 100'000 (30-case : 900000) 
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How it works 


General scheme 
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Stack extractors: poor mans profiler 


For more information 
on Poor Mans Profiler 


Stack extractors: poor mans profiler 


gdb.cmd: 
set pagination O 
define print_ 
bt 
end 
thread apply all print 
quit 


bash: 
gdb --quiet --batch -x gdb.cmd --pid $PID > result.txt 
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gdb.cmd: 
set pagination O 
define print_ 
bt 
end 
thread apply all print 
quit 


bash: 
gdb --quiet --batch -x gdb.cmd --pid $PID > result.txt 


Stack extractors: signal-based profilers 


Conference talk 
«Query Profiler: The Difficult 
Path» 

(in Russian) 


Signal-based profiler 
library 


Stack marking 


Service Stack |Timestamp Experiments Count 


worker all 1500000000 AddSpecialFeatur 1 
DoRunNaked() eStoreLessData 
FiberMain 


RuninFiberContext 
TPrioritizedInvoker::DoExecute 
THttpHandler::HandleCriticalRequest() 
CalculateMillionthFibbonacciNumberRecursivel 


y() 
CalculateMillionthFibbonacciNumberRecursivel 


y() 


Stack marking: Code 


namespace { 
class RequestInfoKeeper { 
static thread local std::string Request; 
public: 
static void SetRequest(const std::string& request) 4 
Request - request; 


} 


static const std::string& GetRequest() { 
return Request; 
} 
y; 


thread local std::string ReguestInfoKeeper: :Request ("default"); 
y; 


const std::string GetRequest() < 
return RequestInfoKeeper::GetRequest(); 


} 


void SetRequest(const std::string& request) { 
return RequestInfoKeeper: :SetRequest (request); 


} 


Stack marking: Code 


namespace { 
class RequestInfoKeeper { 
static thread_local std::string Request; 
public: 
static void SetRequest(const std::string& request) { 
Request = request; 


} 
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y; 


const std::string GetRequest() < 
return RequestInfoKeeper::GetRequest(); 
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return RequestInfoKeeper: :SetRequest (request); 


} 


Stack marking: Code 


namespace { 
class RequestInfoKeeper { 
static thread local std::string Request; 
public: 
static void SetRequest(const std::string& request) { 
Request = request; 


} 


static const std::string& GetRequest() { 
return Request; 
} 
y; 


thread local std::string ReguestInfoKeeper: :Request ("default"); 
y; 


const std::string GetRequest() < 
return RequestInfoKeeper::GetRequest(); 


} 


void SetRequest(const std::string& request) { 
return RequestInfoKeeper: :SetRequest (request); 


} 


Stack marking: Code 


namespace { 
class RequestInfoKeeper { 
static thread local std::string Request; 
public: 
static void SetRequest(const std::string& request) 4 
Request - request; 


} 


static const std::string& GetRequest() { 
return Request; 
} 
h 


thread local std::string RequestInfoKeeper::Request("default"); 


const std::string& GetRequest() { 
return ReguestInfoKeeper: :GetReguest(); 


} 


void SetRequest(const std::string& request) { 
return RequestInfoKeeper: :SetRequest (request) ; 


} 


Stack marking: Gdb 


gdb.cmd: 

set pagination 0 

define print_ 
print GetRequest() 
bt 

end 

thread apply all print 

quit 


bash: 
gdb --quiet --batch -x gdb.cmd --pid $PID » result.txt 


Stack aggregation 


* Aggregation by a day 
* Partial sums 
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* Aggregation by a day 
e Partial sums 


Service Stack rimestamp Experiments Count 


worker allDoRunNaked()FiberMain 1500000000  AddSpecialFeature 1 
RuninFiberContext StorelessData 
TPrioritizedinvoker::DoExecute 
THttpHandler::HandleCriticalReguest() 
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* Aggregation by a day 
* Partial sums 
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2017 StoreLessData 


Stack aggregation 


* Aggregation by a day 
* Partial sums 
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eStoreLessData 
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* Aggregation by a day 
* Partial sums 


EEN EE a o 
worker 4815162342 Fri Jul 14 2017 Default 
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Stack aggregation 


* Aggregation by a day 
* Partial sums 


Service/Stackid Date Experiments Count | 
worker 4815162342 Fri Jul 14 2017 AddSpecialFeature 1 
worker 4815162342 Fri Jul 14 2017 AddSpecialFeature 1 
worker 4815162342 


Fri Jul 14 2017 AddSpecialFeature 1 


Stack aggregation 


* Aggregation by a day 
e Partial sums 


Service Stackld Date Experiments Count 


worker 4815162342 Fri Jul 14 2017 AddSpecialFeature 3 


Conclusion 


e Profiling is important 
e Profiling is complicated 
* Distributed profiling may be a solution 


Leave your feedback! 


You can rate the talk 
and give a feedback on 
what you've liked or 
what could be improved 
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